library(tidyverse)
library(openintro)
library(DATA606)##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 4th Edition. You can read this by typing
## vignette('os4') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
library(psych)Dairy Queen has more of a normal distribution of calories from fat when compared to McDonalds. Both are unimodal and right skewed, though McDonalds has the more pronounced right skew due to a large early cluster of values despite having a max value that is double Dairy Queen’s.
fastfood## # A tibble: 515 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan … 380 60 7 2 0 95
## 2 Mcdonalds Single B… 840 410 45 17 1.5 130
## 3 Mcdonalds Double B… 1130 600 67 27 3 220
## 4 Mcdonalds Grilled … 750 280 31 10 0.5 155
## 5 Mcdonalds Crispy B… 920 410 45 12 0.5 120
## 6 Mcdonalds Big Mac 540 250 28 10 1 80
## 7 Mcdonalds Cheesebu… 300 100 12 5 0.5 40
## 8 Mcdonalds Classic … 510 210 24 4 0 65
## 9 Mcdonalds Double C… 430 190 21 11 1 85
## 10 Mcdonalds Double Q… 770 400 45 21 2.5 175
## # … with 505 more rows, and 9 more variables: sodium <dbl>, total_carb <dbl>,
## # fiber <dbl>, sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>,
## # calcium <dbl>, salad <chr>
summary(fastfood)## restaurant item calories cal_fat
## Length:515 Length:515 Min. : 20.0 Min. : 0.0
## Class :character Class :character 1st Qu.: 330.0 1st Qu.: 120.0
## Mode :character Mode :character Median : 490.0 Median : 210.0
## Mean : 530.9 Mean : 238.8
## 3rd Qu.: 690.0 3rd Qu.: 310.0
## Max. :2430.0 Max. :1270.0
##
## total_fat sat_fat trans_fat cholesterol
## Min. : 0.00 Min. : 0.000 Min. :0.000 Min. : 0.00
## 1st Qu.: 14.00 1st Qu.: 4.000 1st Qu.:0.000 1st Qu.: 35.00
## Median : 23.00 Median : 7.000 Median :0.000 Median : 60.00
## Mean : 26.59 Mean : 8.153 Mean :0.465 Mean : 72.46
## 3rd Qu.: 35.00 3rd Qu.:11.000 3rd Qu.:1.000 3rd Qu.: 95.00
## Max. :141.00 Max. :47.000 Max. :8.000 Max. :805.00
##
## sodium total_carb fiber sugar
## Min. : 15 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 800 1st Qu.: 28.50 1st Qu.: 2.000 1st Qu.: 3.000
## Median :1110 Median : 44.00 Median : 3.000 Median : 6.000
## Mean :1247 Mean : 45.66 Mean : 4.137 Mean : 7.262
## 3rd Qu.:1550 3rd Qu.: 57.00 3rd Qu.: 5.000 3rd Qu.: 9.000
## Max. :6080 Max. :156.00 Max. :17.000 Max. :87.000
## NA's :12
## protein vit_a vit_c calcium
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 16.00 1st Qu.: 4.00 1st Qu.: 4.00 1st Qu.: 8.00
## Median : 24.50 Median : 10.00 Median : 10.00 Median : 20.00
## Mean : 27.89 Mean : 18.86 Mean : 20.17 Mean : 24.85
## 3rd Qu.: 36.00 3rd Qu.: 20.00 3rd Qu.: 30.00 3rd Qu.: 30.00
## Max. :186.00 Max. :180.00 Max. :400.00 Max. :290.00
## NA's :1 NA's :214 NA's :210 NA's :210
## salad
## Length:515
## Class :character
## Mode :character
##
##
##
##
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
describe(mcdonalds$cal_fat)## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 57 285.61 220.9 240 248.94 118.61 50 1270 1220 2.27 6.34
## se
## X1 29.26
hist(mcdonalds$cal_fat)describe(dairy_queen$cal_fat)## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 42 260.48 156.49 220 245.88 126.02 0 670 670 0.91 0.36
## se
## X1 24.15
hist(dairy_queen$cal_fat)The distribution is very nearly a normal distribution except that it’s not exactly symmetrical. There are more values in the first quadrant than in the last quadrant when they should be representing equal populations and negligible populations at that.
dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The simulations are similar and seem within a reasonable margin of error, but they seem to account for lower values between the -2 to -1 quantile than the real data does which shows more of a left skew in that most of it sighter values fall on the trend line.
set.seed(37)
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
ggplot(data = NULL, aes(sample = sim_norm)) +
geom_line(stat = "qq")qqnormsim(sim_norm)hist(sim_norm)Most points in the real data set and simulated sets fall very consistently on their respective trend lines, which is supported by a histogram curve that very nearly matches the standard bell curve that tapers off on either end.
qqnormsim(dairy_queen$cal_fat)hist(dairy_queen$cal_fat)The McDonalds probability plots show emphases towards values on the lower end with values on the higher end often being excluded from the trend line. Plotting a histogram, a right skew is very evident.
qqnormsim(mcdonalds$calories)hist(mcdonalds$cal_fat)Dairy Queen is the best candidate for analyzing sugar and calcium since it’s primarily a dairy dessert destination. I expect that the probability of buying an item that will exceed half od one’s recommended sugar intake per day will be very high (12 grams)
sugarDailyServing <- 12
sugarMean <- mean(dairy_queen$sugar)
sugarSd <- sd(dairy_queen$sugar)
pnorm(sugarDailyServing, sugarMean, sugarSd)## [1] 0.8692298
dairy_queen %>%
filter(sugar < sugarDailyServing) %>%
summarise(n() / nrow(dairy_queen))## # A tibble: 1 × 1
## `n()/nrow(dairy_queen)`
## <dbl>
## 1 0.905
1 - pnorm(sugarDailyServing, sugarMean, sugarSd)## [1] 0.1307702
dairy_queen %>%
filter(sugar > sugarDailyServing) %>%
summarise(n() / nrow(dairy_queen))## # A tibble: 1 × 1
## `n()/nrow(dairy_queen)`
## <dbl>
## 1 0.0952
qqnormsim(dairy_queen$sugar)hist(dairy_queen$sugar)Burger King and Arby’s have distributions that are the closest to the standard distribution model.
restaurantNames <- fastfood$restaurant %>%
unique
restaurantNames## [1] "Mcdonalds" "Chick Fil-A" "Sonic" "Arbys" "Burger King"
## [6] "Dairy Queen" "Subway" "Taco Bell"
for (x in 1:length(restaurantNames)) {
restaurantName <- restaurantNames[x]
restaurantSodium <- (fastfood %>%
filter(restaurant == restaurantName))$sodium
print(paste(restaurantName, "↓", sep = " "))
qqnormsim(restaurantSodium)
hist(restaurantSodium)
}## [1] "Mcdonalds ↓"
## [1] "Chick Fil-A ↓"
## [1] "Sonic ↓"
## [1] "Arbys ↓"
## [1] "Burger King ↓"
## [1] "Dairy Queen ↓"
## [1] "Subway ↓"
## [1] "Taco Bell ↓"
The steps indicate larger bins as opposed to smaller granular bins with many unique values that might better smooth the curve. That might indicate that salt is added or marketed in more controlled and consistent increments.
summary(fastfood$sodium)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 800 1110 1247 1550 6080
Judging by the normal probability plot, the data set is right skewed because the trend line prioritizes lower values towards 0, which would account for values along the left side on the histogram. The resulting histogram confirms the skew in that in its overall range of 0-140, most values fall between a very early range of 20-50
qqnormsim(dairy_queen$total_carb)hist(dairy_queen$total_carb)